knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(plyr))
suppressPackageStartupMessages(library(plotly))
suppressPackageStartupMessages(library(shiny))
options(digits = 2)
knitr::knit_hooks$set(inline = function(x) {
knitr:::format_sci(x, "md")
})
We are exploring an already cleaned Gapminder dataset. Where we will extract relevant and influential information from the data to help us get an idea of what is happening within the dataset. We have been provided with particular questions the researcher would like us to look into. A first half of the analysis will be guided, as in the researcher has provided what they want for us to do. The latter portion of the analysis will be done unguided. Questions will be provided, however, how we go about our analysis will be dependent upon us.
The data was downloaded from Gapminder.org which is
a website that takes data from multiple sources and joins them together
to give you a combined and centralized time-series data source. The
particular data we are looking at comes from (what I believe to be) the
GDP per capita in constant PPP dollars data. This data set
explores the Gross Domestic Product (GDP) per capita of each country by
year.
# load the data into a tibble
data <- read_csv("gapminder_clean.csv")
data <- dplyr::rename(data, "index" = "...1")
To begin we have been asked to look at the year 1962. In particular the researcher would like us to graph and check the correlation between the GDP per cap of each country and the amount of CO2 emissions they expelled between the two columns. We can first subset our data to only include the year 1962. Once we have our subset we can then simply create a scatter plot that includes a linear model line applied to it. This way we can see if the two variables seem to be correlated.
# filter data to Year 1962 only
data.year <- filter(data, Year == 1962)
# scatter plot
ggplot(data.year, aes(`CO2 emissions (metric tons per capita)`, `gdpPercap`)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "CO2 emissions vs gdpPercap Scatter Plot w/ lm fit model\nYear filtered to only 1962")
# generate the correlation
cor.year <- cor.test(data.year$`CO2 emissions (metric tons per capita)`, data.year$gdpPercap)
# assign them to variables
p.value <- cor.year$p.value
corr <- cor.year$estimate[[1]]
Now that we see how the two columns look compared against each other for the year of 1962. We can see by the linear model, (represented by a line in the plot), it looks like the two variables have a high correlation between the two of them. Though visually they seem to be highly correlated, one cannot draw conclusions from this plot. Instead we ran a correlation test on the two columns for a quantitative answer to the question. The correlation between the two columns is 0.93 and a \(p\)-value of 1.13× 10-46. This tells us that our data is correlated and is very significant.
We are curious to see what year, within our data, has the highest
correlation between CO2 emission and GDP per capita. We can do that by
grouping the data by Year and then running a correlation test on each
group’s CO2 emission by GDP per cap columns. We can turn that into a
data.frame and then sort it to grab the highest correlated
Year.
# Grab the highest year
high.corr <- data %>%
group_by(Year) %>%
summarise(estimate = cor.test(`CO2 emissions (metric tons per capita)`, gdpPercap)$estimate) %>%
slice_max(order_by = estimate) %>%
inner_join(data, by = "Year")
highest.year <- high.corr$Year[1]
Once we run this correlation test we find out that the year 1967 has the highest correlation between CO2 emissions and GDP per cap. We are asked to produce an interactive scatter plot for the Year, so that we can better explore the data within this year.
ggplotly(ggplot(high.corr, aes(x = `CO2 emissions (metric tons per capita)`, y = `gdpPercap`)) +
geom_point(aes(color = continent, size = pop)) +
labs(title = "CO2 emission vs gdpPercap grouped by continent\nSubset by Year 1967"))
Since we are now going to analyze these next questions on our own,
lets first get some metrics of the total dataset. We have 2607 rows in
and 20 columns in our data. There are a total of 14651 NA points within
our data. That is 28.1% of the data. This is a sizeable chunk of our
data. Which means during our analysis we may have to remove or fill in
the NA values.
The researcher would like to know if there is a relationship between
continent and
Energy use (kg of oil equivalent per capita). We know that
Energy use (kg of oil equivalent per capita) is a
continuous variable and continent is a categorical
variable. Let’s first get a sense of our continuous variable, we can
plot the histogram of the column and see how it looks.
# Check the distribution of the data
hist(data$`Energy use (kg of oil equivalent per capita)`, main = "Histogram of All Energy Use", xlab = "Energy use (kg of oil equivalent per capita)", breaks = 100)
Looking at the histogram above, we can see the right skew of our continuous dependent variable. The histogram shows us that we have a lot of zero and low values in our data. We can also see that there are quite a few outliers within our data.
Now that we have a sense of the dependent variable. We can look at
our independent variable which is categorical. However, if we check to
see the amount of NAs in the column we can see that there
are a lot of missing values in continent. There are 1323
NA values in the continent column. Though within the
Country Name column we have no missing values. Which means
that we could download a dataset that associates
Country Name with their continent. This would
allow us to fill in hopefully a large portion of the data, with minimal
effort.
# import country df
continent.df <- read_csv("continents2.csv")
continent.df <- rbind(continent.df, list("Arab World", NA, NA, NA, NA, "Asia", NA, NA, NA, NA, NA))
filled_data <- data %>% left_join(continent.df[, c("name", "region")], by = c("Country Name" = "name"))
Once we bring in the other data frame and use it to fill in a new
column called region we get 774 NA values.
Which is a huge improvement from before. Let’s graph our
Energy use (kg of oil equivalent per capita) column by the
newly filled in continent column. While also getting rid of the NA rows
within region, so they don’t confuse the analysis.
# Barplot grouped by continent
sub <- data[with(data, !is.na(`Energy use (kg of oil equivalent per capita)`) &
!is.na(continent)), ]
ggplotly(ggplot(sub, aes(x = continent, y = `Energy use (kg of oil equivalent per capita)`)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Energy use by continent, 1962-2007"))
We can see from the plot above that the data is almost Gaussian looking with Asia being the highest Energy user with the Americas and Europe almost tied for second. We can visually see that there is a large difference between the different continents. Though we should apply an ANOVA test to this data in order to get an actual quantitative value to see if there is a significant difference between groups.
# anova model
aov.model <- aov(`Energy use (kg of oil equivalent per capita)` ~ continent, data = sub)
p.value <- summary(aov.model)[[1]][["Pr(>F)"]][1]
When we run our data through an ANOVA model we can see that we get a significant difference between Energy use by continent. With a \(p\)-value of 8.53× 10-39. However, when we run an ANOVA test, we only have a high-level overview of the data. It may be more informative for us to look and see which comparisons were significant and which group comparisons weren’t. We can run a Tukey HSD for our multiple comparison test.
# tukeyHSD setting margins for the plot
par(mar = c(6, 8, 3, 1))
plot(TukeyHSD(aov.model), las = 1)
Looking at the output from the Tukey HSD model, we can see that the only non-significant comparisons are Oceania-Asia, Oceania-Europe, and Oceania-Americas. All other comparisons are significant. This is interesting as Oceania was the second lowest Energy Use of the five. The significant difference that is observed for Oceania is when it is compared to it’s fellow low Energy user Africa.
We have been asked to check the difference between Europe and Asia
after the year 1990 in regard to
'Imports of goods and services (% of GDP)'. The first thing
we need to do is pull out the data from the year 1990 and after. Then we
can sort by continent and make sure to only grab Europe and Asia.
# Subset data frame
after.1990 <- data %>% filter(Year >= 1990 & continent %in% c("Europe", "Asia"))
Once we subset we get 224 rows within our subset. Now that we have our data subset we can visualize the difference between the two continents and run a welche’s two sample t-test. We have two independent continuous groups making the t-test the ideal model to compare the means of these two groups.
# first graph boxplot
ggplotly(ggplot(after.1990, aes(x = continent, y = `Imports of goods and services (% of GDP)`)) +
geom_boxplot() +
labs(title = "Boxplot to compare the mean's of Asia and Europe"))
# Then t-test
test.out <- t.test(`Imports of goods and services (% of GDP)` ~ continent, data = after.1990)
We can see from the graph above that the means are extremely close to each other, indicating that most likely our two groups are not significantly different from each other. Though we should quantitatively describe this relationship with the mean’s of the two groups. Asia’s mean is 46.85 and Europe’s mean is 41.79. The \(p\)-value from the t.test is 0.18, which is not significant at all. This confirms what we are seeing in the boxplot.
We have been tasked with finding out which of the countries has the
highest average
'Population density (people per sq. km of land area)'
across all years. We can find that out by creating a subset of the data
by year and then grab the countries that have the max amount for that
year. Then you can create a barplot to visualize the top countries and
get a frequency table to figure out which of the countries on average is
high throughout the years.
# Grab the max pops for each year
avg_country <- data %>%
group_by(Year) %>%
slice_max(order_by = `Population density (people per sq. km of land area)`)
# plotly graph
ggplotly(ggplot(avg_country, aes(x = Year, y = `Population density (people per sq. km of land area)`, fill = `Country Name`)) +
geom_bar(stat = "identity") +
labs(title = "Max Population Density by Year since 1962") +
theme(plot.title = element_text(hjust = 0.5)))
Looking at the bar graph above, we can see that the top population
density per year oscillates between only two options. Macao SAR, China
(red) and Monaco (blue). We can see that Macao SAR, China reached a
higher population in general. However, if we look at the number of times
the two appear, we can see that they both appear five times. Meaning
that for the answer to the question which country/countries have the
highest average
'Population density (people per sq. km of land area)' we
have both Macao SAR, China and Monaco as the two highest average
population density across all of the available years in the dataset.
Lastly we want to know which of the countries in our dataset has
shown the greatest increase in
'Life expectancy at birth, total (years)' since the
beginning of our dataset (1962). We can find that out by creating a
separate dataset that has only the year 1962 (lower bound of the Year in
our data) and 2007 (upper bound of the Year in our data). Once we have
this subset we can simply subtract the upper bound minus the lower bound
and get our change over time.
# grab the difference between 2007 and 1962 in a dataframe
y <- data %>%
filter(Year %in% c(1962, 2007) & !is.na(`Life expectancy at birth, total (years)`)) %>%
group_by(`Country Name`) %>%
filter(n() == 2) %>%
mutate(diff_nums = `Life expectancy at birth, total (years)`[Year == 2007] - `Life expectancy at birth, total (years)`[Year == 1962]) %>%
filter(Year == 2007)
# order data
order.total <- y[order(y$diff_nums, decreasing = TRUE), ]
top.five <- order.total[1:5, ]
# Data
ggplotly(ggplot(y, aes(x = reorder(`Country Name`, diff_nums), y = diff_nums)) +
geom_bar(stat = "identity") +
labs(
title = "Country Life Expectancy Increase from 1962-2007",
x = "Countries", y = "Increase of Life Expectancy"
) +
theme(
axis.text.x = element_blank(),
axis.ticks.x = element_blank()
))
knitr::kable(top.five[, c("Country Name", "diff_nums")])
| Country Name | diff_nums |
|---|---|
| Maldives | 37 |
| Bhutan | 33 |
| Timor-Leste | 31 |
| Tunisia | 31 |
| Oman | 31 |
After taking the difference between the Life Expectancy of 1962 to 2007 for each country. We then sorted them and pulled out the top five countries. We can see from the table and the graph that Maldives is the highest increasing country in Life Expectancy between the years of 1962 and 2007. What is more interesting from the graph is the countries on the far left that have actually gotten worse in life expectancy.
top.full.five <- data %>% filter(`Country Name` %in% c(
"Maldives", "Bhutan",
"Timor-Leste", "Tunisia", "Oman", "Nepal"
))
ggplotly(ggplot(top.full.five, aes(
x = Year, y = `Life expectancy at birth, total (years)`,
color = `Country Name`, group = `Country Name`
)) +
geom_line() +
geom_point() +
labs(title = "Life Expectancy Increasing from 1962 to 2007"))
If we graph the top five we can see that the Maldives just barely beats out some of the other top five. However, Maldives has the largest change over the years.
Exploring the Gapminder data we saw that there was a significant correlation between CO2 emissions and gdpPercap in the year 1962. Though we ultimately found out that the year 1967 had the highest correlation between CO2 emissions and gdpPercap. We explored Energy use between continents and showed that there was a significant difference between the Energy use between them. We saw that there was no difference between Asia and Europe and the Imports of goods and services. Over the years we figured out that Monaco and Macao SAR, China went back and forth between having the highest Population density over the years 1962 to 2007. Then finally we concluded that the Maldives has had the highest increase in Life Expectancy between 1962 to 2007 out of the countries present within our dataset.